The Searching Shared HLA Amino Acid Residue Prevalence (SSHAARP) Package processes amino acid alignments produced by the IPD-IMGT/HLA Database to identify user-defined amino acid residue motifs shared across HLA alleles, and calculates the frequencies of those motifs based on HLA allele frequency data. SSHAARP uses Generic Mapping Tools (GMT) software and the GMT R Package to generate global frequency heat maps that illustrate the distribution of each user-defined map around the globe. SSHAARP analyzes the allele frequency data described in Solberg et al. 2008, a global set of 497 population samples from 185 published datasets, representing 66,800 individuals.
Future plans for SSHAARP include producing heat maps based on allelic frequencies obtained from the Allele Frequency Net Database (AFND).
Latest amino acid alignments can be found at https://github.com/ANHIG/IMGTHLA/tree/Latest/alignments.
The Solberg (Solberg et al, “Balancing selection and heterogeneity across the classical human leukocyte antigen loci: A meta-analytic review of 497 population studies”, Hum Immunol (2008), doi: 10.1016/j.humimm.2008.05.001) dataset can be found at http://pypop.org/popdata/ under results.zip.
SSHAARP has three main families of functions that are based on which pieces of information/software they interact with:
IPD-IMGT/HLA:
countSpaces counts the number of whitespaces in the ANHIG alignment files to discern where the alignment start is.
AA_segments_maker extracts alignment sequence information for a given locus from the ANHIG/IMGTHLA Github Repository to produce a dataframe, which contains individual amino acid data for each amino acid position for all alleles, for a user-defined HLA locus or loci.
motif_finder isolates the individual amino acid position dataframe produced by AA_segments_maker to only alleles with the user-defined motif.
Solberg Dataset:
coordinate_converter converts coordinates in the Solberg dataset with cardinal directions to their appropriate enumerations (i.e. 50S is converted to -50)
dataSubset manipulates the Solberg dataset by adding in a new column containing locus*allele information, reordering the dataset based on population name, and subsetting the data to only the locus of interest.
Generic Mapping Tools (GMT)
PALM produces a heatmap from the further manipulated Solberg dataset by using the Generic Mapping Tools (GMT) R Package, which is an interface between R and the GMT Map-Making software. GMT commands are called via a Bash script through using the gmt.system() function in the GMT R package. Users must have the GMT R package AND GMT software installed, which is downloadable here: https://www.soest.hawaii.edu/gmt/.
AA_segments_maker extracts alignment sequence information for a given locus from the ANHIG/IMGTHLA Github Repository to produce a dataframe, which contains individual amino acid data for each amino acid position for all alleles, for a user-defined HLA locus or loci. The first 4 columns are locus, allele, trimmed allele, and allele_name.
Example output of AA_segments_maker for HLA-A, where the selected row outputs are 2, 549, and 800 to illustrate different HLA-A alleles, and selected columns are the first 20 columns of the dataframe, where the first 4 columns consist of the locus, allele, trimmed_allele of 2 fields, and the full allele name. The rest of the columns consist of corresponding amino acids for each amino acid position for a given allele.
>AA_segments_maker("A")[[1]][c(2,8),1:20]
locus allele trimmed_allele allele_name 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2 A 01:01:01:01 A*01:01 A*01:01:01:01 M A V M A P R T L L L L L S G A
549 A 02:01:30 A*02:01 A*02:01:30 M A V M A P R T L V L L L S G A
800 A 02:30:01 A*02:30 A*02:30:01 M A V M A P R T L V L L L S G A
The first row of AA_segments_maker is a correspondence table of amino acid position enumerations. Alignments begin with a negative enumeration for the the amino acid positions within the first exon, which encodes the leader peptide, since the leader peptide is eventually cleaved off. The rest of the amino acid positions associated with exons other than exon 1 have positive enumerations. In addition, the alignment counts do not give amino acid positions with inDels a count. The correspondence table was created to begin enumeration at position 1, since it is more standard, and to account for the skipped over positions with inDels. Below is an example of what the correspondence table looks like – the alignment sequence for the first amino acid postion, -24, corresponds to a more standard beginning enumeration of 1.
>AA_segments_maker("A")[[1]][1,1:20]
locus allele trimmed_allele allele_name 1 2 3 4 5 6 7 8 9 10 11 12
1 <NA> <NA> <NA> <NA> -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13
13 14 15 16
1 -12 -11 -10 -9
###
motif_finder splits up the input motif (should be in format LocusAA_1AA_2AA_3, where the amino acid positions are in numerical order). If the motif is not in numerical order, such as DRB130Y28E26F, the function will rearrange the motif such that the 26F is searched for first. The following output is an example of the 1st, 5th, and 9th DRB1 alleles that have the DRB1*26F28E30Y motif. Recall the dataframe is formatted with actual sequence enumeration (starting with 1), but the search is done by matching the input motif, which is based off of alignment enumeration, with the correspondence table. Thus, in the output dataframe, alignment position 26 is equivalent to actual position 55, 28 to 58, and 30 to 60.
>motif_finder("DRB1*26F~28E~30Y")[c(1,5,9,10), 1:65]
locus allele trimmed_allele allele_name 1 2 3 4 5 6 7 8 9 10 11
1 <NA> <NA> <NA> <NA> -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19
197 DRB1 03:02:04 DRB1*03:02 DRB1*03:02:04 * * * * * * * * * * *
240 DRB1 03:38 DRB1*03:38 DRB1*03:38 * * * * * * * * * * *
256 DRB1 03:53 DRB1*03:53 DRB1*03:53 * * * * * * * * * * *
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
1 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8
197 * * * * * * * * * * * * * * * * * * * * * * * R F L
240 * * * * * * * * * * * * * * * * * * * * * * * R F L
256 * * * * * * * * * * * * * * * * * * * * * * * R F L
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
1 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 InDel_1 27 28 29 30 31
197 E Y S T S E C H F F N G T E R V R F . L E R Y F
240 E Y S T S E C H F F N G T E R V R F . L E R Y F
256 E Y S T S E C H F F N G T E R V R F . L E R Y F
If a motif is not found, an error message is output:
>motif_finder("DRB1*26F~28E~30Z")
Warning message:
In motif_finder("DRB1*26F~28E~30Z") :
Error - zero alleles match this motif. Please try again.
dataSubset manipulates the Solberg dataset to prepare for heatmap generation by adding in a new column containing locus*allele information, reordering the dataset based on population name, and subsetting the dataset to only the locus of interest.
Partial depiction of Solberg Dataset before dataSubset - note population names are out of order, and the data consists of data from multiple HLA loci
>dataset[c(1,2,300,600, 1000),]
dataset popname contin complex latit longit coord locus allele_v2
1 lit Korean_1999 NEA 3 37.00N 127.30E 37.00N 127.30E DRB1 1302
2 lit Gabes_2006 NAF 2 34N 10E 34N 10E DRB1 1311
300 ihwg Ami_97 SEA 1 25.067N 121.65E 25.067N 121.65E B 5601
600 ihwg Brazilian OTH 3mig 10S 55W 10S 55W B 5001
1000 ihwg Chaouya NAF 2 35.5N 8W 35.5N 8W A 2902
allele_v3 allele.freq allele.count gametes
1 13:02 0.11001 111 1009
2 13:11 0.00526 1 190
300 56:01 0.20408 40 196
600 50:01 0.03371 6 178
1000 29:02 0.06716 9 134
Partial depiction of Solberg Dataset after dataSubset - note population names are now in lexicographic order, the only locus of note in the dataset is the locus provided in the motif (i.e. DRB1), and there is a new column, locus_allele, for comparison to the dataframe produced by AA_segments_maker.
>dataSubset(dataset, "DRB1*26F~28E~30Y")[c(1,2,300,600, 1000),]
dataset popname contin complex latit longit coord locus allele_v2
8615 lit Ache_2003 SAM 1 24S 56W 24S 56W DRB1 0403
8616 lit Ache_2003 SAM 1 24S 56W 24S 56W DRB1 0411
9365 lit Bari_1994 SAM 2 10N 70W 10N 70W DRB1 0403
10068 lit Cameroon_2001b SSA 3 6.00N 12.00E 6.00N 12.00E DRB1 1102
10679 lit Czech_1992 EUR 3 49.45N 15.30E 49.45N 15.30E DRB1 0301
allele_v3 allele.freq allele.count gametes locus_allele
8615 04:03 0.00578 1 173 DRB1*04:03
8616 04:11 0.74566 129 173 DRB1*04:11
9365 04:03 0.03636 4 110 DRB1*04:03
10068 11:02 0.03175 8 252 DRB1*11:02
10679 03:01 0.11111 21 189 DRB1*03:01
coordinate_converter converts coordinates in the Solberg dataset with a cardinal direction to its appropriate numerical enumeration for plotting on a map. North (N), which is above the equator, and East (E), which is east of the prime meridian are positive, while South (S), below the equator, and West (W), west of the prime meridian, are negative.
Partial depiction of Solberg Dataset before coordinate_converter
popname contin latit longit
8615 Ache_2003 SAM 24S 56W
101 Algerian_99 NAF 35N 0.5W
17938 BrazilGuaraniNandeva SAM 23S 54W
Partial depiction of Solberg Dataset after coordinate_converter
>coordinate_converter(heatmapdata)
popname contin latit longit
8615 Ache_2003 SAM -24 -56
101 Algerian_99 NAF 35 -0.5
17938 BrazilGuaraniNandeva SAM -23 -54
countSpaces splits a given string, and white spaces are counted in between split elements. This function was used to count whitespaces in the alignment data downloaded from the ANHIG Github repository to determine the starting amino acid position. See below:
Prot -30 1
| |
A*01:01:01:01 MAVM APRTLLLLLS GALALTQTWA GSHSMRYFFT SVSRPGRGEP RFIAVGYVDD
A*01:01:01:02N ---- ---------- ---------- ---------- ---------- ----------
A*01:01:01:03 ---- ---------- ---------- ---------- ---------- ----------
A*01:01:01:04 ---- ---------- ---------- ---------- ---------- ----------
A*01:01:01:05 ---- ---------- ---------- ---------- ---------- ----------
A*01:01:01:06 ---- ---------- ---------- ---------- ---------- ----------
Alignment files from the ANHIG repository are formatted such that a certain number of characters must be present in a given line to maintain the alignment structure, so the start position is not immediately provided in the file. To avoid hard coding a start position, countSpaces was used to calculate the number of spaces between the pipe, which is right underneath the starting count (-30). In order to accomplish this, the following items were needed: 1)the number of spaces between the word “Prot” and starting count, “-30”, 2)the number of spaces from the end of the first allele name to the beginning of the amino acid start position 3)the number of characters in the first allele where the number of spaces between -30 and the beginning of the first amino acid were calculated by adding item 3 with item 2, and subtracting the sum of that from item 1.
#number of whitespaces counted for row 3 in the alignment file
>countSpaces(alignment[[1]][3])
[1] 1 11 1 1 1 1 1 1 1 1 1
PALM produces a heatmap from the manipulated Solberg dataset by using the Generic Mapping Tools (GMT) R Package, which is an interface between R and the GMT Map-Making software. GMT software is required for this function and can be downloaded at https://www.soest.hawaii.edu/gmt/. GMT commands are called via a Bash script through using the gmt.system() function in the GMT R package. The final output of PALM is a heatmap saved as a JPEG in the user’s working directory.
Two default parameters are set for PALM: 1) color = TRUE Should the heatmap produced be in color? The default for color is set to true for a color heatmap – for a map in grayscale, set color==FALSE. 2) filter_migrant = TRUE Should OTH and migrant populations be filtered out of the dataset? The default for filter_migrant is set to true – set this parameter equal to FALSE to keep migrant and OTH populations in the dataset.
#JPEG with color
>PALM(Solberg_datset, "DRB1*26F~28E~30Y" )
…
#JPEG without color
>PALM(Solberg_datset, "DRB1*26F~28E~30Y", color=FALSE)